library(tidyverse)
library(janitor)Maria’s Cleaning
Goals of this notebook
The steps we’ll take to prepare our data:
- Download the data
- Import it into our notebook
- Clean up data types and columns
- Export the data for next notebook
Setup
Loading the libraries
Downloading the Data
In this section of the cleaning notebook, we’re loading the VERSA program data. This data accounts for arrests made under Operation Lone Star starting in January 2023.
Here is the summary of the differences between SPURS system and VERSA from the Depart of Public Safety Communications:
“Please note that arrests are now documented on two spreadsheets due to the implementation of a new record management system in 2023 by the Texas Highway Patrol (THP). Prior to January 2023, THP utilized the Enterprise Management (EM) System for traffic enforcement activity (written warnings, citations, and CMV inspections) and SPURS for documenting entrusted property and criminal case reports. Beginning in January of 2023, THP underwent a statewide rollout of the Versaterm Software Suite, completed in August 2023, which serves as THP’s only RMS system for regular traffic enforcement activity, entrusted property, as well as general and arrest offense report entry.”
More information about Operation Lone Star is found in the index page of this website.
The data was obtained by a Public Records Request from KUT.
Importing the Data
Here we’re importing the VERSA data spreadsheet that I already have downloaded to my “data-raw” folder. Then, we’re adding our data to a code chunk. For simplicity it will be called “versa_analysis,” which we will then peak at in order to look at what our data looks like in order to clean.
versa_analysis<- read_csv("data-raw/versa_raw_data.csv") |> clean_names()Rows: 17256 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): Arrest Date, County Name, Primary Key, Jurisdiction, Statute, Clas...
dbl (3): Arrest Officer1, Charge Count, Apparent Age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
versa_analysisversa_analysis |> glimpse()Rows: 17,256
Columns: 18
$ arrest_date <chr> "1/3/2023", "1/3/2023", "1/21/2023", "1/22/2023", …
$ county_name <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ primary_key <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX202…
$ jurisdiction <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "T…
$ arrest_officer1 <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ statute <chr> "481.121(b)(1)", "481.116(c)", "15", "30.05(d)(1)"…
$ class <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code_group <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ charge <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WA…
$ charge_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ given_one <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ given_three <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ given_two <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ individual_surname <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ ethnicity <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ race <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ gender_code <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ apparent_age <dbl> 1997, 1997, 1974, 1987, NA, 1998, 2003, NA, 1995, …
Notes for VERSA cleaning:
- “Given three” refers to two middle names, only very few individuals in the list have this listed
- Years are given as “apparent age” changed this to the actual age number to line up to SPURS “Person Age”
Fixing the dates
We want to make sure that our dates are in month, date, year format for consistency in our analysis.
versa_analysis_date <- versa_analysis |>
mutate(
charge_date = mdy(arrest_date)
)
versa_analysis_date |> glimpse()Rows: 17,256
Columns: 19
$ arrest_date <chr> "1/3/2023", "1/3/2023", "1/21/2023", "1/22/2023", …
$ county_name <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ primary_key <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX202…
$ jurisdiction <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "T…
$ arrest_officer1 <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ statute <chr> "481.121(b)(1)", "481.116(c)", "15", "30.05(d)(1)"…
$ class <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code_group <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ charge <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WA…
$ charge_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ given_one <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ given_three <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ given_two <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ individual_surname <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ ethnicity <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ race <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ gender_code <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ apparent_age <dbl> 1997, 1997, 1974, 1987, NA, 1998, 2003, NA, 1995, …
$ charge_date <date> 2023-01-03, 2023-01-03, 2023-01-21, 2023-01-22, 2…
versa_analysis_dateSummary Stats:
versa_analysis_date |> summary() arrest_date county_name primary_key jurisdiction
Length:17256 Length:17256 Length:17256 Length:17256
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
arrest_officer1 statute class severity_code_group
Min. : 6154 Length:17256 Length:17256 Length:17256
1st Qu.:14978 Class :character Class :character Class :character
Median :15752 Mode :character Mode :character Mode :character
Mean :15295
3rd Qu.:16099
Max. :16551
NA's :18
charge charge_count given_one given_three
Length:17256 Min. : 0.000 Length:17256 Length:17256
Class :character 1st Qu.: 1.000 Class :character Class :character
Mode :character Median : 1.000 Mode :character Mode :character
Mean : 1.217
3rd Qu.: 1.000
Max. :60.000
given_two individual_surname ethnicity race
Length:17256 Length:17256 Length:17256 Length:17256
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
gender_code apparent_age charge_date
Length:17256 Min. :1944 Min. :2023-01-01
Class :character 1st Qu.:1986 1st Qu.:2023-06-30
Mode :character Median :1995 Median :2023-10-19
Mean :1993 Mean :2023-10-24
3rd Qu.:2001 3rd Qu.:2024-03-02
Max. :2023 Max. :2024-07-11
NA's :6073
Fixing the age
We want our VERSA data to match up to SPURS when it comes the the “Person Age” data column. In order to do that we subtract this year’s date minus the “apparent_age” column of the VERSA data. This gives us the age in the same numerical format.
versa_analysis_age <- versa_analysis_date |>
mutate(
person_age = 2024- apparent_age
)
versa_analysis_age |> glimpse()Rows: 17,256
Columns: 20
$ arrest_date <chr> "1/3/2023", "1/3/2023", "1/21/2023", "1/22/2023", …
$ county_name <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ primary_key <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX202…
$ jurisdiction <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "T…
$ arrest_officer1 <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ statute <chr> "481.121(b)(1)", "481.116(c)", "15", "30.05(d)(1)"…
$ class <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code_group <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ charge <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WA…
$ charge_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ given_one <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ given_three <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ given_two <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ individual_surname <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ ethnicity <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ race <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ gender_code <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ apparent_age <dbl> 1997, 1997, 1974, 1987, NA, 1998, 2003, NA, 1995, …
$ charge_date <date> 2023-01-03, 2023-01-03, 2023-01-21, 2023-01-22, 2…
$ person_age <dbl> 27, 27, 50, 37, NA, 26, 21, NA, 29, NA, 30, 21, 25…
Fixing Names
We want to change the columns for the names of the offenders, so it matches up to the SPURS data set. Theb we want to change the officer column to “officer” and
versa_analysis_names <- versa_analysis_age |>
mutate(
person_first_name = given_one,
person_second_name = given_two,
person_third_name = given_three,
person_last_name = individual_surname,
officer_id = arrest_officer1,
arrest_county = county_name,
code = class,
severity_code = severity_code_group,
person_race_abbr = race,
person_gender_abbr = gender_code,
)
versa_analysis_names |> glimpse()Rows: 17,256
Columns: 30
$ arrest_date <chr> "1/3/2023", "1/3/2023", "1/21/2023", "1/22/2023", …
$ county_name <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ primary_key <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX202…
$ jurisdiction <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "T…
$ arrest_officer1 <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ statute <chr> "481.121(b)(1)", "481.116(c)", "15", "30.05(d)(1)"…
$ class <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code_group <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ charge <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WA…
$ charge_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ given_one <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ given_three <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ given_two <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ individual_surname <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ ethnicity <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ race <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ gender_code <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ apparent_age <dbl> 1997, 1997, 1974, 1987, NA, 1998, 2003, NA, 1995, …
$ charge_date <date> 2023-01-03, 2023-01-03, 2023-01-21, 2023-01-22, 2…
$ person_age <dbl> 27, 27, 50, 37, NA, 26, 21, NA, 29, NA, 30, 21, 25…
$ person_first_name <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ person_second_name <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ person_third_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ person_last_name <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ officer_id <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ arrest_county <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ code <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ person_race_abbr <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ person_gender_abbr <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
Create new column to match SPURS race column
Because the SPURS dataset uses the “race” column to identify both ethnicity and race, we created a new column that mirrors the SPURS way of documenting people’s profiles.
versa_analysis_new <- versa_analysis_names |>
mutate(
person_race_abbr = if_else(ethnicity== "H", ethnicity, person_race_abbr),
arrest_state = case_match(jurisdiction, "TX" ~ "TEXAS")
)
versa_analysis_new |> glimpse()Rows: 17,256
Columns: 31
$ arrest_date <chr> "1/3/2023", "1/3/2023", "1/21/2023", "1/22/2023", …
$ county_name <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ primary_key <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX202…
$ jurisdiction <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "T…
$ arrest_officer1 <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ statute <chr> "481.121(b)(1)", "481.116(c)", "15", "30.05(d)(1)"…
$ class <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code_group <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ charge <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WA…
$ charge_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ given_one <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ given_three <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ given_two <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ individual_surname <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ ethnicity <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ race <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ gender_code <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ apparent_age <dbl> 1997, 1997, 1974, 1987, NA, 1998, 2003, NA, 1995, …
$ charge_date <date> 2023-01-03, 2023-01-03, 2023-01-21, 2023-01-22, 2…
$ person_age <dbl> 27, 27, 50, 37, NA, 26, 21, NA, 29, NA, 30, 21, 25…
$ person_first_name <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ person_second_name <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ person_third_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ person_last_name <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ officer_id <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ arrest_county <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ code <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ person_race_abbr <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ person_gender_abbr <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ arrest_state <chr> "TEXAS", "TEXAS", "TEXAS", "TEXAS", "TEXAS", "TEXA…
Filter out unnnessary columns
After cleaning the names of the columns, I want to remove the columns with the old column names.I will use the select function to select the columns I don’t want.
versa_analysis_filter <- versa_analysis_new |>
select(-c(apparent_age, given_one, given_two, given_three, individual_surname, arrest_officer1, jurisdiction, county_name, arrest_date, statute, class, severity_code_group, race, gender_code))
versa_analysis_filter |> glimpse()Rows: 17,256
Columns: 17
$ primary_key <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX2023…
$ charge <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WAR…
$ charge_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ ethnicity <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", "…
$ charge_date <date> 2023-01-03, 2023-01-03, 2023-01-21, 2023-01-22, 20…
$ person_age <dbl> 27, 27, 50, 37, NA, 26, 21, NA, 29, NA, 30, 21, 25,…
$ person_first_name <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", "…
$ person_second_name <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "E…
$ person_third_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ person_last_name <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGUE…
$ officer_id <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 15…
$ arrest_county <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", "…
$ code <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "PC…
$ severity_code <chr> "Versa Misdemeanor Charges", "VT Felony Charges", "…
$ person_race_abbr <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", "…
$ person_gender_abbr <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", "…
$ arrest_state <chr> "TEXAS", "TEXAS", "TEXAS", "TEXAS", "TEXAS", "TEXAS…
Removing extra labels in “severity_code” and creating clean column
We noted that Versa and SPURS record charges differently under “severity_code.” We aren’t too sure if the Versa and VT label is equivalent to SPURS labels so we created a new column for cleaning purposes and we won’t remove the original column. This new column will be “severity_code_clean.”
versa_code_clean <- versa_analysis_filter |>
mutate(
severity_code_clean = case_match(severity_code, "Versa Misdemeanor Charges" ~ "Misdemeanor Charges", "VT Felony Charges" ~ "Felony Charges", "VT Federal/Other Charges" ~ "Federal/Other Charges"),
.after = severity_code
)
versa_code_clean Importing cleaned data
versa_code_clean |>
write_rds("data-processed/01-clean-versa.rds")